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This survey describes the major optimization techniques of com- 
pilers and groups them into three categories: machine dependent, 
architecture dependent, and architecture independent. Machine^ 
dependent optimizations tend to be local and are performed 
upon short spans of generated code by using particular properties 
of an instruction set to reduce the time or space required by a 
program. Architecture-dependent optimizations are global and 
are performed while generating code. These optimizations con- 
sider the structure of a computer, but not its detailed instruction 
set. Architecture-independent optimizations are also global but 
are based on analysis of the program flow graph and the depend- 
encies among statements of source program. The paper also pre- 
sents a conceptual review of a universal optimizer that performs 
architecture-independent optimizations at source-code level. 
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INTRODUCTION 

Most computer systems support a multiplicity of program- 
ming languages; for a particular language, the translators or com- 
pilers often exist in three versions. The first version is a small, 
fast compiler, which is for program development and has exten- 
sive diagnostics and debugging aids. The second version is a re- 
entrant conversational compiler, which is used for online 
development of programs and has comprehensive editing 
facilities. The third compiler is the optimizing compiler, which is 
used for translating production programs into efficient object 
code and is larger and slower than the others. In this paper, we 
examine the techniques employed in optimizing compilers and 
make some quantitative comparisons between the programs of 
the optimizing compilers and other compilers, A large number of 
the examples and references in the paper are FORTRAN-retated, 
because FORTRAN is the most widely used production program- 
ming language. 

The history of optimizing compilers dates back at least as 
far as FORTRAN I (I). At that time, most programming was 
done in machine language, and a compiler that offered con- 
venience at the expense of machine time would not have been 
acceptable. The following quotation from an International Busi- 
ness Machines Corporation (IBM) specification reveals that the 
convenience of the new language was believed insufficient to 
cause its widespread acceptance. 

...FORTRAN may apply complex, lengthy techniques in 
coding a problem which the human coder would have 
neither the time nor inclination to derive or apply. Thus, in 
many cases, FORTRAN may actually produce a better pro- 
gram than the normal human coder would be apt to pro- 
duce. (I) 


Even in that very first FORTRAN compiler, 25 percent of the 
instructions were for optimization. 

OPTIMIZATION TECHNIQUES 

The following three sections describe various optimization 
techniques that have been used in compilers or have been sug- 
gested for compilers. Very little has been done to classify optimi- 
zations; they are grouped here by function. 

Compiler optimization techniques operate on three levels: 
machine dependent, architecture dependent, and architecture 
independent. Machine dependent is used to describe the 
instruction-level sensitivities of a compiler. Architecture depend- 
ent denotes those parts of a program that relate to the general 
hardware implementation, but not to a specific machine. 
Architecture independent (used in lieu of the more familiar 
phrase-machine independent) indicates those aspects of program 
formulation that do not depend on a particular computer system 
or even On a. type of implementation (e.g. pipeline processing). 
Optimizations originating in the academic and scientific com- 
munity tend to be global, while, until recently, manufacturers 
have concentrated on local and machine-dependent techniques. 

Machine-Dependent Optimization 

One of the earliest references on compilation techniques 
concerns the Project for the Advancement of Coding Techniques 
(PACT), an experimental compiler. The target machine was the 
IBM 701, and the PACT compiler, described by Miller and 
Oldfield (2), produced code sensitive to the register-placement 
curiosities of that machine. No formal techniques were em- 
ployed; rather, a set of rules was coded in tabular form to control 
code generation. To a large extent, this same technique is appli- 
cable today for machine-dependent code optimization. The 
FORTRAN I compiler contained a sophisticated arithmetic trans- 
lator by Sheridan (3) that performed association and commuta- 
tion to take advantage of the AC/MQ relationship on the 
IBM 704. For example, a string of multiplications and divisions 
was reordered to minimize the number of register transfers 
( exchanges ) that had to be performed. 

McKeeman (4) proposes a postprocessing technique for 
optimization, which can be considered as a window traversing the 
sequence of generated (unoptimized) code. If the instructions 
visible in the window match one of a number of patterns, the 
code is transformed. In this manner reduncjl(at stores, multiplica- 
tions by two, and register transfers can easily be optimized. 
Bagwell (5) describes a set of clever coding tricks (special cases) 
that may be implemented for almost any machine. Although 
performed during code generation, this is essentially McKeeman's 
approach. These machine-dependent optimizations are the most 
descriptive of available techniques. 

Architecture-Dependent Optimization 

Three optimization techniques are classified here as archi- 
tecture dependent. These techniques are used for machines 
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having one or more of the following general characteristics: 

1 . The computer has n accumulators. 

2. The computer can execute several independent instruc- 
tions in parallel. 

3. The computer executes arithmetic and logical instruc- 
tions upon multiple data streams. 

The evolution of computer architecture has followed a path from 
the single-accumulator IBM 704 to the multiple-accumulator 
CDC 6600, which is capable of executing several instructions in 
parallel, to the 1LLIAC IV, which operates on up to 64 data 
items simultaneously. Optimization techniques have had a par- 
allel evolution. 

The n Accumulator Computer 

Straightforward code generation of expressions involving 
noncommutative operations poses a special difficulty for a one- 
accumulator computer. In an expression such as (a+b)/(c-d), the 
denominator should be computed first to be available for division 
when the numerator is computed and is in the accumulator. 
Anderson (6) discusses a technique that implements this proce- 
dure and eliminates the need to store and recover the values of 
many subexpressions. Anderson’s technique for a one- 
accumulator computer looks ahead and delays code generation 
for the left-side expression of a noncommutative operator until 
code generation for the right side occurs. One a multipie- 
accumulator machine, the technique is also valuable because it 
decreases the number of registers required to evaluate an 
expression. 

Nakata (7) extends this procedure to handle n accumula- 
tors. The procedure is enhanced by the fact that some heuristic 
observations are included to make the output similar to ordinary 
coding practices. The programming problem of using a minimum 
number of accumulators is equivalent to a graph-theoretic tree 
transformation proposed by Redziejowski (8). He proposes an 
algorithm for performing the tree transformation and proves it 
equivalent to that of Nakata. A study by Schneider (9) of the 
properties of tree-structure representations of arithmetic expres- 
sions yields the number of required registers: For k nested 
parenthetical subexpressions with n operator precedence levels 
(kH)n+I registers are required. 

Finkelstein (10) describes a technique, deferred store, which 
eliminates much of the unnecessary storing and loading of partial 
results within loops on multipie-register machines. ( Register is 
used here to indicate either an accumulator or index register.) 
When an assignment statement is executed, the accumulator is 
not actually stored in the result variable. Instead, other registers 
replace those containing data for deferred stores. If a result varia- 
ble is to be modified before the deferred store has been per- 
formed, the value , of the variable is in place and need not be 
fetched, The following example indicates a common situation 
where the deferred store saves a significant amount of time: 

n DO l l=I,N 

Z) a i 1 SUM - SUM + A(I) 

i=I 


A special case of the n accumulator machine is one where 
the accumulator is the top element of a pushdown stack. The 
Burroughs 5000 and English Electric KDF9 are examples of such 
a machine. Randeil and Russell (11) describe a one-pass proce- 
dure for translation of arithmetic expressions into a Reverse- 
Polish form suitable for a stack machine. An interesting point is 
their architecture-independent optimization that calculates a 
constant during compilation when both operands are constants. 
Generalizations of this technique are discussed in the next 
section. 

Parallel-Instruction Execution 

The CDC 6600 computer was the first commercially avail- 
able machine to overlap the execution of several instructions. 


This class of computer, capable of parallel-instruction execution, 
has independent functional units that operate simultaneously. 
The programmer (or compiler) need not be explicitly aware of 
the parallel-execution capability; it will be used when possible. 
However, if the instructions are ordered to maximize parallel 
execution, a performance advantage of up to a factor of three 
can be obtained. Allard, Wolf, and Zemlin (12) describe the par- 
allel capabilities of the CDC 6600 and briefly mention the speed 
advantage gained by reordering instructions. Thorlin(13) de- 
scribes the technique used in CDC FORTRAN, which is based on 
a PERT-like analysis of dependency and timing for ordering 
CDC 6600 instructions. The machine’s independent-functional 
units are kept busy by placing unrelated instructions together 
and sequencing the longest activities first. A similar instruction- 
scheduling technique was implemented by Blum, etal. (14) in 
the IBM FORTRAN H compiler. It constructs a dependency 
array that defines the area within which each instruction may be 
moved. A weight is assigned to each instruction by adding a base 
weight (a function of the instruction time) and the weight of 
every instruction which is dependent on it; instructions are then 
ordered by decreasing weight. 

However, the improvement attainable by instruction re- 
ordering is limited by the parallelism inherent in the original 
instruction sequence. Stone (15) summarizes techniques that 
may be used to translate arithmetic expressions to achieve a high 
degree of inherent parallelism. His process corresponds to a tree 
structure with minimal height. This is the opposite result from 
that of the n accumulator, where minimizing the number of regis- 
ters increased the tree height. Figure 1 shows two different trees 
for evaluating an expression. The first tree employs only one 
register for evaluating the expression; the second tree results in 
minimum evaluation time on a machine with at least four simul- 
taneous multipliers. Because of the effects of data store/load in- 
structions, the second tree may not result in minimum time on 
machines with fewer multipliers. Figure 2 shows two different 
sets of instructions for evaluating the expression corresponding 
to the two trees, which result in serial and parallel execution. 



o) TREE YIELDING MINIMUM NUMBER OF REGISTERS 
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b) TREE YIELDING MAXIMUM INHERENT PARALLELISM 

Figure I. Tree Structure for Serial and Parallel 
Computation of an Expression. 
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a) SERIAL EXECUTION, TIME = 22 CYCLES 
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b) PARALLEL EXECUTION, TIME = 18 CYCLES 


Figure 2. Resonances Used in Serial and Parallel 
Computation of an Expression 


Han (16) examines the general problem of minimizing the 
tree height of a set of expressions. His procedure for determin- 
ing an expression’s minimum tree height can be used by a 
compiler to measure the degree of parallelism obtainable in a 
program. Ramamoorthy and Gonzalez (17) describe a method 
for attaining the maximum amount of parallel execution on a 
machine with a fixed number of processing units. Their method 
orders subexpressions so that some expressions can be delayed if 
insufficient processing units are available to perform all compu- 
tations in parallel. 


Multiple Data Streams 

The preceding section discussed some techniques relevant to 
computers whose architecture permitted parallel execution of in- 
structions while maintaining the standard instruction set. This 
section discusses the optimization techniques applicable to com- 
puters where the instruction set reflects the computer’s capacity 
to perform a single instruction on many data items. Two com- 
puters are in this category: the CDC STAR and the Burroughs 
ILLIAC IV. The STAR, a pipeline computer, processes operands 
sequentially, but with a high degree of overlap. The ILLIAC, a 
parallel computer, processes 64 operands simultaneously. The 
instruction sets of the two machines are remarkably similar, 3nd 
high-level language programs must be designed from the same 
viewpoint for both machines. 

For programs written in a procedural language, 
Burkhardt.(18) describes some occurrences of inherent paral- 
lelism. He points out that parallelism may occur from the arith- 
metic-expression level, through independent iterations of a loop, 
to parallel-task execution within an operating system environ- 
ment. Millstein (19), reporting on the design of a FORTRAN 
compiler for the ILLIAC IV, discusses a compiler that will detect 
parallelism in the use of subscripted variables in DO loops. This 
three-step procedure first determines data dependencies and 
then, if there are more dependencies between loop iterations, 
examines How within the loop and determines an expression 
ordering. The first two steps arc analyzed by graph-theoretic 
techniques; the last by ad hoc methods. A later report by 


Lamport and Presberg (20) gives a detailed description of the 
algorithms and techniques used to permit parallel execution of 
DO loops. 

Schneck(21) developed a simplified algorithm for the 
detection of parallelism in standard FORTRAN programs and 
defined the concept of feedback which prevents parallel execu- 
tion. Testing for feedback involves a flow analysis of the source 
program and a search for subscript forms that cause feedback, In 
the absence of feedback, statements are rewritten to indicate 
parallel execution. Additionally, scalar variables that might bar 
parallel .execution are expanded to vectors. Thus, the following 
statement may be performed entirely in parallel. 


DO 1 I = 1 ,20 
A = (B(I) + B(I+1» * .5 
C(I) “ C(I) + A 
1 D(I) = D(I) - A 

Kuck, Muraoka, and Chen (22) performed an analysis simi- 
lar to Schneck’s. Their orientation was to define a machine archi- 
tecture to process ordinary programs. They conclude that, even 
for simple programs, a multiple-processor organization, consisting 
of 1 6 processors, is of value. 

Architecture-Independent Optimization 

Architecture-independent optimization techniques are 
global in nature; they perform a flow analysis on the source 
program to obtain necessary information. This section summa- 
rizes major architecture-independent optimizations, which are 
illustrated in Figure 3. The most widely applied optimization is 
common subexpression elimination. When a calculation is per- 
formed, a . search is made to determine if the calculation was 
performed previously and need not be repeated; if so, the prior 
result replaces the calculation. Dead variable elimination removes 
statements that assign values to unused variables in the program. 
These unused variables most frequently result from program 
modifications, but may also be due to common subexpression 
elimination. Code motion refers to the rearrangement of expres- 
sions permitting a calculation to occur in a low-frequency pro- 
gram segment and to be available for use in a high-frequency 
segment. Finally, constant propagation removes calculations con- 
taining only known constants from the program and performs 
them in the compiler. This is certainly code motion to a low- 
frequency segment. 

Frequency Analysis 

The original FORTRAN compiler (23) contained an opti- 
mizer that gathered information on 'the source program’s struc- 
ture. The source program was analyzed and broken down into a 
set of basic blocks, 1 and a table listing the predecessors of each 
basic block was created. This table was then used in a Monte 
Carlo simulation to find the relative frequency execution of each 
basic block. A random number generator, augmented by pro- 
grammer estimates supplied in FREQUENCY statements,^ was 
used to traverse paths in the program flow graphs, and a count 
was kept for each basic block. Figure 4 shows a program flow 
graph, which is used as an example throughout this section, and 
Table 1 shows the relative frequencies obtained by simulation. 

Next, the source program blocks were optimized in order 
from highest to lowest frequency. The target computer, an 
IBM 704, had only three index registers, and much of the com- 
piler’s optimization centered on assigning them efficiently in the 
most frequently executed blocks of the program. Code genera- 
tion was performed for a machine assumed to have as many 


' A basic block is the fundamental program flow unit; it is a 
segment of code with only one entry and one exit point. 

2 According to John Cocke, the FREQUENCY statement was 
removed from the language after it was discovered to have been 
incorrectly implemented (frequencies were being computed in- 
versely) without having encountered any user reaction. 



A=B + C 

Y = Y+i. 

Z = C 

Q S (Z + B)* SIN (.7854) 

DO 11=1,100 
1 P(I)sp(I) *{A+B) 

sample program 

A-B+C 

*■•— a 

b 

Q = A *.7071 -• c,d. 

T 00001 = A + B — e 

DO i 1=1,100 
i P(l)=P{l}*T0000l 

optimized program 

a. ELIMINATION OF DEAD VARIABLE 

b. ELIMINATION OF DEAD VARIABLE, CAUSED BY C 
C- COMMON SUBEXPRESSION ELIMINATION 

d. CONSTANT PROPAGATION 
e- CODE MOTION 

Figure 3. Architecture-Independent Optimization 
index registers as required. Then, the optimizer efficiently as- 
signed the 7Q4’s three index registers within the highest fre- 
quency block of the program, while interpolating instructions to 
save and restore index register values when necessary. As other 
blocks were processed, index register assignments were made to 
match those of adjacent (immediate predecessor or successor) 
higher frequency blocks. When no adjacent blocks had already 
been processed, no matching of index registers was necessary. 
When just one adjacent block had already been processed, it was 
necessary only to choose a matching permutation of the index 
registers. If two or more adjacent blocks had already been proc- 
essed, the possibility of matching all index registers was uncer- 
tain. Therefore, it became necessary to add instructions for 
loading index registers with the values required by adjacent 
blocks. The progression of processing from high-frequency to 
low-frequency blocks caused the added instructions to be located 
in the later low-frequency blocks. 


Table 1. Relative Frequencies Obtained 
by Simulation 


BLOCK 

RELATIVE 

FREQUENCY 

1 

1.0 

2 

8.2 

3 

33.0 

4 

16.5 

5 

16.5 

6 

28.9 

7 

4.1 

8 

7.1 

9 

1.0 



Figure 4. Program Flow Graph with 
Branch Probabilities 


In another study, Horwitz (24) describes a graph-theoretic 
procedure for index-register allocation* An optimal index-register 
allocation may be obtained for straight-line (loop-free) programs. 
Horwitz’s algorithm is a practical procedure for carrying out the 
highly combinatorial assignment process similar to that employed 
by FORTRAN I. Also, Luccio (25) provides a further reduction 
of the enumeration required for an optimal allocation. The pro- 
gram graph is partitioned, and the allocation problem may be 
solved separately for each subgraph and then combined. 

Day (26) discusses an alternate linear-programming ap- 
proach for assignment of registers. He demonstrates an optimal 
algorithm and gives two others which provide good approxima- 
tions, but are fast enough for use in a compiler. 

Matrix Analysis 

Prosser (27) describes a Boolean-matrix approach to flow- 
graph analysis that avoids the lengthy Monte Carlo techniques 
used in FORTRAN I. The predecessor information obtained by 
analysis of the program is used to construct a connection matrix 
(Table 2). The connection matrix C has a 1 at C« if, and only if, 
program block / is a direct successor of program block i. By 
repeated matrix multiplication, the connection matrix may be 
used to determine the sets of blocks participating in loops. 
Prosser also introduces the dominance relation to indicate that a 
particular block (dominator) must be traversed before another 
block (dominee) can be reached. This construct is extremely 
valuable. When a calculation is moved out of a block, it must be 
moved to a dominator block to assure that it will be performed. 
These matrix operations yield valuable, if lengthy, methods for 
obtaining program-flow information, Warshall (28) describes a 
simplification of the multiplication of n x n Boolean matrices 
that reduces the time required from Ofn 3 ) to O(n^), which 
makes matrix techniques practical. 
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Table 2. Boolean Connection Matrix C 
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For some general results on what may be obtained from the 
connectivity matrix of a program flow graph, Ramamoorthy (29) 
presents algorithms that identify unessential nodes, enumerate 
the maximum strongly connected regions (i.e. loops), and parti- 
tion the flow graph into disjoint subgraphs. These matrix manip- 
ulations are basic to obtain the information required for program 
optimization. Unessential nodes may be discarded from a pro- 
gram because they will never be executed. The identification of 
the maximum strongly connected regions permits locating rela- 
tive-constant expressions and moving them to lower frequency 
regions, as well as indicating on which blocks the optimization 
process should concentrate. Partitioning the flow graph into dis- 
joint subgraphs permits working with smaller units at a time, 
which results in a significant decrease in the combinatorial efforts 
expended in optimization. 

In a FORTRAN optimizer, Allen (30) uses matrix methods 
for the analysis of a program’s flow graph. The connection ma- 
trix is used to obtain a set of strongly connected regions which 
the optimizer processes from the inside out. Within a basic block, 
redundant expressions are evaluated only once and then elimi- 
nated. Constant propagation is performed, and expressions are 
replaced by their computed values. Within a loop, invariant in- 
structions are moved out, strength reduction is performed, and 
tests are simplified. Unused definitions and computations are 
eliminated where the flow information indicates this is possible. 
The procedures to effect these optimizations take advantage of 
the bit-parallel operations found in most machines and perform 
logical operations, a word at a time. Allen’s article is extremely 
comprehensive and shows the details involved in applying each of 
the optimizing techniques. 

Following Allen’s optimization techniques, Kleir and 
Ramamoorthy (31) describe optimization procedures for micro- 
programs, which may be viewed simply as another source lan- 
guage requiring translation to machine language. The connec- 
tivity matrix is used to find strongly connected regions which are 
processed innermost to outermost, and code motion is performed 
to decrease instruction execution frequency. Within a basic 
block, common-subexpression elimination and dead-variable 
elimination is performed (referred to by the authors as redundant 
actions and negated actions, respectively). 

In his book, Gries (32) devotes an entire chapter to a dis- 
cussion of code optimization techniques. He views optimization 
at three levels: within a basic block, within a loop, and globally. 
The global optimization techniques are patterned after Allen. 

In FORTRAN I, subscript calculations were performed at 
any definition of a variable used in the subscript, which led to 
inefficient codes when many definitions occurred with few uses. 
Most other compilers simply recomputed a subscript for each use 


or kept track of local multiple uses of a subscript (e.g., within an 
assignment statement). In an article, Ryan (33) considers the 
problem of determining where a common subscript (or any com- 
mon expression) may be computed with minimum frequency to 
be available when required. Ryan’s algorithm may be used with a 
multipass compiler and permits locating computations within a 
lower frequency region at a distance from the point of use. 

In the 1960s, the advent of new hardware brought a new 
class of compilers and optimization techniques. In an IBM tech- 
nical report, Medlock and Lowry (34) describe the optimization, 
techniques that are the foundation of the IBM FORTRAN H 
compiler. These techniques extend the dominance relationship 
introduced by Prosser. In addition, a new defining relationship 
makes it possible to replace the dominance array with four vec- 
tors and reduce the space required. Frequency information is 
obtained by inverting the probability connection matrix p, where 
py indicates the probability that program block / will succeed 
block /. Table 3 shows the probability connection matrix for the 
example flowchart and its inverse. In Table 4, row one indicates 
frequency relative to block one. 


Table 3. Probability Connection Matrix P 
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Determination of relative frequencies again permits ordering 
the processing of blocks from highest to lowest frequency. With- 
in that order, common subexpressions may be eliminated and 
expressions may be moved from high- to tow-frequency blocks. 
The dominator relationships are used to ensure availability of an 
expression when needed. When profitable, strength reduction will 
be performed with initialization instructions placed in a domi- 
nator block. Subsumption, or substituting one variable for 
another if they are equal at each reference, minimizes the 





number of simple replacement operations in a program. In 
another . report, Lowry and Medlock (35) describe the 
FORTRAN H production compiler, discuss the compiler’s imple- 
mentation, and suggest several additional optimizations. An inter- 
esting point about the compiler is that it is written primarily in 
FORTRAN. First run on the IBM 7094, the compiler was used to 
create a new version of itself for the IBM 360. When the 
optimizer had been tested, it was used to translate the compiler, 
which resulted in a 25 percent decrease in size and a 35 percent 
decrease in compilation time. To achieve a reasonable processing 
speed, the compiler uses bit-vectors which can be processed by 
the bit-parallel logical instructions available on the IBM 360. 
Lowry and Medlock also discuss register assignment and code 
generation techniques. 

Graph-Theoretic Analysis 

Busam (36) reviews the UNI VAC 1 100 series, which em- 
ploys a three-pass optimizing compiler. The first pass encodes all 
operations into a uniform tabular format. To achieve maximum 
recognition of common subexpressions, redundant information 
may be added to the table, while flow information is maintained 
in a list containing all statement numbers and references. The 
second pass scans the code in reverse order and performs com- 
mon-subexpression elimination and movement of loop-invariant 
computations. The compute point of each expression is deter- 
mined, and the expression evaluation is moved to that (lower 
frequency) point. In contrast, IBM FORTRAN H will succes- 
sively move a computation out of each loop level until it can no 
longer be moved. The determination of an expression’s compute 
point permits high-speed compilation because an expression need 
not be moved more than once. The second and third passes of 
the compiler are also concerned with register assignment and 
code generation. 

In the USSR, the ALPHA automatic programming system 
(Yershov 37and 38) for the M-20 computer produces object 
code that in some cases is nearly the equal of hand coding. The 
major architecture-independent optimizations include optimiza- 
tion of subscripts within FOR loops and the elimination of 
redundant subexpressions within a basic block. A feature of the 
ALPHA compiler, not found elsewhere, is the attempt to mini- 
mize the number of locations occupied by data within a program. 
While many optimizations result in a decrease in the size of a 
program because fewer instructions are generated, ALPHA specif- 
ically minimizes the storage requirements for data and variables 
by permitting several variables to share a storage location if their 
uses do not interfere with one another. This is a generalization of 
subsumption, which coalesces two variables into one if they con- 
tain the same value whenever used. 

Much of the work in the past two years has centered around 
the use of the Cocke-Alien interval analysis technique, which was 
described first by Cocke and Schwartz (39). This technique is 
based on the interval, a partially ordered set of basic blocks with 
the following properties: 

1 . An interval is a set function of a distinguished block 
called the head. 

2. All blocks in an interval, except the head, have all their 
immediate predecessor blocks in the interval. 

Intervals are easily and rapidly constructed, and they readily 
identify inner loops and a possible processing order within each 
loop. The ordering induced by the interval construction is valu- 
able because dominators always precede their dominecs. 
Common subexpression elimination is simplified in this context 
because the redundancy of a computation is indicated by its 
presence in a dominator block. The interval construction process 
may be iterated (treating intervals as basic blocks), and higher 
order loops will then be identified. Most program graphs will be 
reduced to a single node by repetition of this procedure. Those 
few program graphs which are not reducible may be transformed 
into reducible graphs by the process oi.node splitting (Cocke and 
Miller (40)). Figure 5 shows the intervals obtained from the 
example flowchart and how they may be used to compute 
relative frequencies. 




DETERMINATION OF 
FREQUENCY 


FREQUENCY = 

P=Z Pi 
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Figure 5. Interval Analysis and Frequency Determination— Relative 
Frequency of a Loop May Be Determined from the Probability of a 
Loop-Closing Branch 


In the proceedings of the Association of Computing Machin- 
ery SIGPLAN'S Symposium on Compiler Optimization, 
Allen (41) indicates that over 90 percent of the program graphs 
subject to analysis were reducible. She gives algorithms that 
determine the back dominator of each node in an interval, the 
articulation blocks of an interval, and the maximum strongly 
connected region within an interval. The use of interval analysis 
for global optimization is shown with an example that demon- 
strates how information is relayed through successive iterations 
of the interval construction. In the same proceedings, Cocke (42) 
describes a method for common-subexpression elimination based 
upon interval analysis. The information required to determine 
whether a computation is redundant may be coded as a large 
system of Boolean equations. The interval technique permits 
solution of this system without having to perform a tedious 
Gauss elimination. Only two passes through the system of equa- 
tions are required. A later paper by Allen and Cocke (43) sum- 
marizes techniques for identifying intervals and properties of 
blocks within intervals. The concept of node splitting to permit 
reduction of an arbitrary program flow graph is discussed in 
detail. 

Kennedy (44) has built upon Allen and Cocke’s work. He 
has divided questions concerning data flow into two classes: 

1. Those referring to the status of variables on entry to a 
block 

2. Those referring to the effect of computations within a 
block on later computations 

Since the first class of problems has been solved, Kennedy gives 
an approach to the second. An algorithm for the identification of 
dead variables is shown, which, like Cocke’s common subexpres- 
sion elimination algorithm, requires only two passes. The two 
passes perform logical operations on bit-vectors, which may be 
performed in parallel on most machines, resulting in a very high- 
speed algorithm. 
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' J. "“{w* number technique described by Cocke and 
Schwartz assigns a unique identifier to each calculation in apro- 
«ram. Whenever a calculation is to be performed, a table lookup 
determines whether it is currently available. Because a numeric 
identifier is associated with each calculation, formal identity is 
not required to find a common subexpression (Figure 3). Pro- 
gnun-tlow properties that render a match impossible are reflected 
Dy assigning new identifiers at statement labels. 

Schneck and Angel (45) have combined the interval analysis 
“i valu e number techniques in an optimizer which accepts and 
produces FORTRAN programs. All of the techniques referred to 
at the beginning of this section are implemented, and new opti- 
mizations are introduced. Strict ordering of nodes within an 
interval permits the value number technique to be applied glob- 
ally and eliminates virtually all common subexpressions. A 
second pass over the program, in the manner of Cocke and 
Kennedy, permits global constant propagation to be performed 
The efficacy of optimization at the level of a programming lan- 
guage is also discussed. 

A recent paper by Hecht and Ullman (46) introduces a pair 
of transformations which may be used in flow-graph analysis 
iranstorniation 7>jemoves an edge which begins and ends at the 
same node. Transformation TVcondenses node a into its unique 
umnediate predecessor b resulting in alb. A flow graph is called 
collapsible if, and only if, repeated application of Ti and Ty 
results in a single node. Figure 6 indicates the collapsibility of the 
example flow graph. Collapsibility is shown to be equivalent to 
interval reducibility. The time to determine collapsibility is 
r"? 1 e the t,me t0 deter mine interval reducibility may 
b f\ . '- Information obtained by interval analysis may also be 

obtained by application of Tj and T 2 . 



T, (4,7) 

T, (3,4/7) 

T, (3/4/ 7, 5) 



T, ( 3/4/7/5,6) 
T, (8,9) 



T, (3/4Z7/5/6) 

T 2 (3/4/7/5/6,8/9) 


T, (3/4/7/5/E/B/9) T, (1,2/3 /4/5/ 
T2 (2,3/4/7/5/6/8/91 6/7Z8/9) 


Figure 6. Collapsibility-Repeated Applications of Two 
Transformations Yield a Single Node in 0(n log n) Steps 


CONCLUSION 

™ Th .® P° werful architecture-independent optimizations are 
responsible for most of the increased speed obtained by an 
optimizing compiler. Schneck and Angel (45) have shown that 
these optimizations may be applied before compilation and 
achieve almost all that a compiler can. With IBM’s FORTRAN H 
architecture-independent optimization accounts for 80 percent 
of the speed increase. With the CDC compiler for the 6600 
external optimization produces code faster than the compiler 
can. In summary, an external architecture-independent compiler 
supplemented by a machine-oriented compiler,, is the most cost- 
effective technique. This is true for the manufacturer, who may 

Who w 1° the compi]er ’ 35 we!l as for ^e programmer, 

who will find debugging easier in this environment because he 
can see what changes have been effected. 

REFERENCES 

1. International Business Machines Corporation, The IBM 
Mathematical Formula Translating System. FORTRAN, 

R.C. Miller and B. J. Oldfield, “Producing Computer In- 
3^95 f ° r thS PACT 1 Compiler,” Journal of the ACM, 

3 ' fhMRM Arithmetic Translator-Compiler of 

the IBM FORTRAN Automatic Coding System,” Communi- 
cations of the ACM. 2, 1 959. 

W. M. McKeeman, “Peephole Optimization,” Communica- 
tions of the ACM. 8, 1965. 

J. T Bagwell Jr., "Local Optimizations,” ACM SIGPLAN 
Notices. 5, No. 7, 1970. 

J. P. Anderson, “A Note on Some Compiling Algorithms,” 
Communications of the ACM, 1, 1 964. 

I. Nakata,^ “On Compiling Algorithms for Arithmetic Ex- 
pressions, Communications of the ACM 10 1967 
R. R. Redziejowski, “On Arithmetic Expressions and 
i recs, Communications of the ACM, 1 2, 1 969. 

V Schneider “On the Number of Registers Needed To 
m ti Valuate Arithmetic Expressions,” BIT 1 1 1971 
10. M Finkelstein, "A Compiler Optimization Technique,” The 
Computer Journal. 11, 1968. 

B R andell and L J. Russell/Single-scan Techniques for 
ie Translation of Arithmetic Expressions in ALGOL 60 ” 
Journal of the ACM 1 1, 1954. 

of the ftsrio A ‘ r° ,f ’ a , nd R ' Al Zemlin - “ Some Effects 
Computer on Language Structures,” Communi- 
cations of the A CM, 7, 1 964. ' 

tion Ex°ecutinnTr e Ge f rat ,i°" f ° r PIE (Parallel Instruc- 
non Execution) Computers, Proceedings of the Spring 

Joint Computer Conference, 1967 . P S 


2 . 


4. 

5. 

6 . 

7. 

8 . 
9. 


II. 


12 . 


13. 


G 



14 . International Business Machines Corporation, TR 00.2240, 
Current Technologies in FORTRAN Object Code 
Optimization. D. Blum, S. K. Brown, A. G. Calavano, 
H. O. Hempy, and J. Suez, 1971. 

15. H. S. Stone, "One-Pass Compilation of Arithmetic Expres- 
sions for a Parallel Processor,” Communications of the ACM, 
1967. 

16. J. C. Han, Tree Height Reduction for Parallel Processing of 
Blocks of FORTRAN Assignment Statements, National 
Technical Information Service, PB-207985, 1972. 

17. C. V. Ramamoorthy and M. J. Gonzalez, "Subexpression 
Ordering in the Execution of Arithmetic Expressions,” 
Communications of the A CM, 14, 1971. 

18. W. H. Burkhardt, "Automation of Program Speed : Up on 
Parallel-Processor Computers,” Computing, 3, 1968. 

19. R. E. Milstein, Compiler Design for the 1LLIAC TV , Na- 
tional Technical Information Service, AD 719417, 1971. 

20. L. Lamport and D. Presberg, Concurrent Compiling, Na- 
tional Technical Information Service, AD 742279, 1972. 

21. P. B.Schneck, “Automatic Recognition of Parallel and 
Vector Operations in a Higher Level Language,” Proceedings 
of the ACM National Conference, 1972. 

22. D. J. Kuck, C. Muraoka, and S.C.Chen, "On the Number 
of Operations Simultaneously Executable in FORTRAN- 
Like Programs and Their Resulting Speedup,” IEEE Trans- 
actions on Computers, C-21, 1972. 

23. J. W. Backus, etal. "The FORTRAN Automatic, Coding 
System,” Proceedings of the Western Joint Computer , \9^1 . 

24. L. P. Horwitz, R. M. Karp, R. E. Miller, and S. Winograd, 
"Index Register Allocation,” Journal of the ACM. 13, 1966. 

25. F. I.uccio, “A Comment on Index Register Allocation,” 
Communications of the ACM, 10, 1967. 

26. W. H. E. Day, “Compiler Assignment of Data Items to 
Registers,” IBM Systems Journal, 9, 1970. 

27. R. T. Prosser, "Applications of Boolean Matrices to the 
Analysis of Flow Diagrams,” Proceedings of the Eastern 
Joint Computer Conference, 1 959. 

28. S. Warshall, “A Theorem on Boolean. Matrices,” Journal of 
the ACM, 9, 1962. 

29. C. V. Ramamoorthy, “Analysis of Graphs by Connectivity 
Considerations,” Journal of the A CM, 11, 1964. 


30. F. E. Allen, “Program Optimization,” Annual Review of 
Automatic Programming, New York: Pergamon Press, 1969. 

31. R. L. Kleir and C. V. Ramamoorthy, “Optimization Strate- 
gies for Microprograms,” IEEE Transactions on Computers, 
C-2G, 1971. 

32. D. Cries, Compiler Construction for Digital Computers, 
New York: John Wiley and Sons, 1 972. 

33. J. T. Ryan, “A Direction-Independent Algorithm for Deter- 
mining the Forward and Backward Compute Point for a 
Term or Subscript During Compilation,” The Computer 
Journal, 9, 1966, 

34. International Business Machines Corporation, TR 00.1330, 
Global Program Optimization, C. W. Medlock and 
E. S. Lowry, 1965. 

35. E. S. Lowry and C. W. Medlock, “Object Code Optimiza- 
tion,” Communications of the A CM, 12, 1969. 

36. V. A. Busam and D. E. Englund, “Optimization of Expres- 
sions in FORTRAN,” Communications of the, ACM, 12, 
1969. 

37. A. P. Yershov, “ALPHA- An Automatic Programming 
System of High Efficiency,” Journal of the ACM, 13, 1966. 

38. A. P. Yershov, The ALPHA Automatic Programming Sys- 
tem, New York: Academic Press, 1971. 

39. J. Cocke and J. T. Schwartz, Programming Languages and 
Their Compilers, New York: New York University, 1969. 

40. J. Cocke and R. E. Miller, “Some Analysis Techniques for 
Optimizing Computer Programs,” Proceedings of the Sec- 
ond Hawaii International Conference of System Sciences. 
19.69. 

41. F. E. Allen, “Control Flow Analysis,” ACM SIGPLAN 
Notices, 5,1970. 

42. J. Cocke, “Global Common Subexpression Elimination,” 
' ACM SIGPLAN Notices, 5, 1970. 

43. F. E. Allen and J. Cocke, “Graph Theoretic Constructs for 
Program Control Flow Analysis,” Unpublished paper. 

44. K. Kennedy, “A Global Flow Analysis Algorithm,” Inter- 
national Journal of Computer Mathematics, 3, 1971. 

45. P. B. Schneck and E. Angel, “A FORTRAN to FORTRAN 
Optimizing Compiler,” The Computer Journal, to be 
published in 1973. 

46. M. S. Hecht and J. D. Ullman, “Flow Graph Reducibility,” 
SIAM Journal of Computing, I, 1971. 


'*1 



